概要

Privacy is a fundamental human right and the constitutions of many countries mention the right to privacy. However, the development of computer science is threatening that right. Security and privacy measures are essential for the safe and secure utilization of data. The main difference between privacy and security is who is the attacker. As for security, the data needs to be protected from outsiders, and the correct data needs to be delivered to authorized recipients. Privacy, on the other hand, needs to take into account that even authorized recipients areattackers because data subjects have rights associated with their data. Therefore, it is important to strike a balance between privacy and utility, and technologies are needed to achieve both requirements.

Various privacy protection techniques have been proposed so far, and we focus on deidentification techniques. There are various directions of de-identification methods. In this dissertation, we broadly divide them into de-identification techniques for quasi-identifiers, deidentification techniques for sensitive attributes, and de-identification techniques based on information theory. k-anonymization is a typical de-identification method for quasi-identifiers, and has been adopted in various fields because it preserves overall trends in the data. kanonymization can be achieved by various techniques such as generalization by hierarchical trees, swapping, clustering, and top/bottom coding, and a great number of k-anonymization algorithms have been proposed. However, this type of methods may be vulnerable to sensitive attributes. Therefore, de-identification methods for sensitive attributes such as l-diversification has been proposed. This type of methods are difficult to achieve for real data, and could significantly reduce its usefulness. Furthermore, when de-identified data is published, various attackers can be assumed, and de-identification methods and privacy metrics according to their attacker models are in disarray. The impact of combining de-identification techniques has not been well investigated, although it is believed that combining de-identification techniques would provide a better balance between privacy and utility. In fact, many publicly available data incorporate multiple de-identification techniques. In this context, k-anonymity is still widely accepted metric because its concepts and metrics are simple and straightforward, and privacy of the de-identified data will be protected by legal deterrents and systems that take privacy-by-design into account. On the other hand, researchers have proposed de-identification techniques based on information theory to achieve perfect privacy protection. Differential privacy is a prime example, and various mechanisms have been proposed to date. Differential privacy adds probabilistic noise to the output result, or all data, which may compromise the properties of the original data except for necessary information, e.g., the results of a particular analysis. Therefore, it is necessary to design ad-hoc methods for different use cases. Privacy and utility are in the relationship of trade-off, and it is difficult to resolve them all completely.

In this study, we will examine privacy-preserving techniques that focus mainly on the utility of data and the understandability of privacy strength. Therefore, we consider the privacy strength as the probability that an individual will be re-identified. This is a generalization of the most widely known idea of k-anonymity. However, it is difficult to come up with a privacy metric that is consistent across formats and types of data. Privacy data can be divided into structured data and unstructured data. Furthermore, there are static and dynamic data for each type of data. Unstructured data such as documents are difficult to handle and there are few existing studies. Moreover, since dynamic data has more dimensions than static data, the curse of dimensionality problem is serious. Therefore, de-identification techniques also need to address the issues of privacy and usefulness in terms of data format.

Privacy-preserving techniques include not only de-identification techniques but also secure computation. Secure computation allows multiple data holders to compute a specific function while keeping each other's input secret. This allows for the integration and analysis of data across multiple institutions. It also allows an institution with data to outsource its analysis to another institution, which leads to secure data utilization. The main issue with secure computation is the amount of computation and communication. Furthermore, since the secure computation provides correct output, there is a possibility that privacy information can be leaked from the output value. Therefore, in secure computation, the privacy of the output must be protected, or the output must be available only to the data holder.

This dissertation presents a study on utility-aware privacy preserving techniques, and focuses on de-identification techniques. In particular, we aim for easy-to-understand metrics of the strength of privacy and usefulness in order to seek understanding from data subjects for data provision. First, we start with the basic idea of privacy in Chapter 1 and mention the need for privacy-preserving techniques. Then, in Chapter 2, we introduce the basic privacy-preserving techniques and related techniques used in our proposal. In Chapter 3, we describe several related studies on privacy-preserving techniques and discuss the remaining issues.

We propose de-identification methods for structured static, structured dynamic, and unstructured static data in terms of data format. We consistently consider the probability that an individual is re-identified as the strength of privacy in this paper, but give different privacy metrics depending on the format of the data.

In Chapter 4, we focus on structured static data, which is the most common type of data. We focus on three types of de-identification methods: generalization, noise addition, and sampling, and experimentally evaluate de-identified data combined with the various methods. It is easy to combine de-identification methods and is often done to generate flexible de-identified data. However, there is no metric to evaluate the privacy strength of such de-identified data. Since it is difficult to simply evaluate the privacy of de-identified data using a combination of methods, we conducted an evaluation using attack simulations. This evaluation is effective when the system is designed with privacy in mind and the data is properly managed. Since the proposed privacy metric is equivalent to k-anonymity, it is expected to facilitate data subjects' understanding of data use. In addition, we proposed a utility metric for de-identified data, assuming that the data will be used as training data for machine learning. This idea of a simulation-based privacy and utility metrics can be applied to any type of data. The results of our experiments show that data de-identified by the combined method retains utility compared to data de-identified by a single method when they have the same privacy.

In Chapter 5, we investigate de-identification methods for structured dynamic data. Dynamic data is need to consider time factor and balancing privacy and utility is an especially difficult problem. One of the directions of de-identification of dynamic data is the modification of pseudonymized attributes. This eliminates the association between data from different times of the same person. However, because individuals tend to behave similarly, it can be difficult to completely break the connection between data even if different pseudonyms are used. Therefore, we assume a re-identification attack and a linkability attack on dynamic data. We treat dynamic data as a matrix and propose a de-identification method using matrix manipulation in this chapter. As in the case of static data, we used a privacy metric equivalent to k-anonymity and a utility metric that takes into account actual use. In our experiment, we conducted an evaluation using actual web access log data and showed that the proposed method can maintain its utility compared to conventional de-identification methods. Furthermore, we showed that even with different pseudonymization processes at different times, i.e., treating the same person at different times as a different person, it is still possible to identify individuals with high probability. We also confirmed that the privacy risk against linkability attack can be significantly reduced with a very small amount of data processing.

In Chapter 6, we investigate de-identification techniques for unstructured data, mostly document. Since it is difficult to directly assess the privacy and utility of unstructured data, we start with the structuring of data through morphological analysis. We need to consider the possibility that anyone can be an attacker because documents can be publicly available, such as court documents or reports of accidents that occur in schools. Disturbing the data like differential privacy may not be appropriate since the documents need to be readable and understandable by people. Therefore, we proposed an attacker model and an attack algorithm that assumes a very powerful attacker with access to external databases. Documents are generally de-identified manually, and many existing studies have used how close to manually de-identified documents they can be as a privacy metric. On the other hand, we considered that privacy risks still lurk in manually de-identified documents. In our experiments, we applied the proposed attack algorithm to real manually de-identified documents, and confirmed that we could actually re-identify individuals from them. We found the proposed attack algorithm to be effective, and proposed an algorithm to process the risky words used in the attack that could lead to the identification of individuals. Our algorithm is able to prevent strong attacks using web search with minimal data processing, and thus maintain its utility. As mentioned above, we have shown that our simulation-based attack and countermeasure can be applied to various data formats with k-anonymization in mind, which is easy for data subjects to understand. All of our proposed de-identification methods maintain a high level of utility compared to simple de-identification methods.

In the following chapters, we investigate other privacy-preserving techniques toward further research.

In Chapter 7, we will focus on differential privacy mechanism. Differential privacy requires the calculation of sensitivity depending on the query. We defined a differential privacy model with dummy data in order to deal with the case where it is difficult for us to calculate sensitivity. The only difference from the original definition is that we do not consider arbitrary data, and various existing differential privacy mechanisms satisfy the proposed definition. Furthermore, we designed a differential privacy mechanism that uses the calculation of t-values in a t-test as a query as a concrete example. The impact of the proposed mechanism on the p-value is large, and the privacy parameter needs to be large to avoid the type I error. Alternatively, we need to replace the problem to reduce the sensitivity and get closer to the correct result instead of performing the t-test directly and this discussion is ongoing.

In Chapter 8, we focus on basic secure comparison protocols, which is well known as the millionaire's problem. Comparison protocol is a basic two-party protocol and is used to compare two values in secret, but it is very important because it is widely used in data mining and machine learning. Even in the comparison of two values, there are numerous classifications depending on who has the data, in what state (e.g., encrypted, shared, etc.), and who gets the output. We systematized this classification. We then describe conversions among these types of protocols. Many conversions have been dealt with in existing studies, so the types and transformations are dealt with in the chapters 2 and 3. These conversions allows us to construct a comparison protocol for one setting by converting an existing protocol from a different setting. Moreover, we proposed a base protocol; using this in combination with the above mentioned conversions allows to obtain an efficient comparison protocol for any of the configurations captured in our taxonomy. Finally, we implemented the proposed protocol and compared it with previous works. The results show that the proposed protocol outperforms the most efficient existing protocol in terms of computation time by taking advantage of its parallel processing capability.

Finally, we conclude by summarizing our results and future works, and provide a direction to preserve privacy.

Top